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Abstract 

Purely data driven approaches for machine learning present difficulties when data is scarce relative to the complexity of 
^— «, the model or when the model is forced to extrapolate. On the other hand, purely mechanistic approaches need to identify 

^sj and specify all the interactions in the problem at hand (which may not be feasible) and still leave the issue of how to 

parameterize the system. In this paper, we present a hybrid approach using Gaussian processes and differential equations 
to combine data driven modelling with a physical model of the system. We show how different, physically-inspired, kernel 
functions can be developed through sensible, simple, mechanistic assumptions about the underlying system. The versatility 
of our approach is illustrated with three case studies from motion capture, computational biology and geostatistics. 

^ 1 Introduction 

Traditionally the main focus in machine learning has been model generation through a data driven paradigm. The usual ap- 
proach is to combine a data set with a (typically fairly flexible) class of models and, through judicious use of regularization, 
make predictions on previously unseen data. There are two key problems with purely data driven approaches. Firstly, if data 
is scarce relative to the complexity of the system we may be unable to make accurate predictions on test data. Secondly, if 
the model is forced to extrapolate, i.e. make predictions in a regime in which data has not yet been seen, performance can 

,__! be poor. 

^>- Purely mechanistic models, i.e. models which are inspired by the underlying physical knowledge of the system, are com- 

OS mon in many domains such as chemistry, systems biology, climate modelling and geophysical sciences, etc. They normally 

make use of a fairly well characterized physical process that underpins the system, often represented with a set of differ- 
ential equations. The purely mechanistic approach leaves us with a different set of problems to those from the data driven 
approach. In particular, accurate description of a complex system through a mechanistic modelling paradigm may not be 
possible. Even if all the physical processes can be adequately described, the resulting model could become extremely com- 
plex. Identifying and specifying all the interactions might not be feasible, and we would still be faced with the problem of 

^_l identifying the parameters of the system. 

Despite these problems, physically well characterized models retain a major advantage over purely data driven models. A 

. ^h mechanistic model can enable accurate prediction even in regions where there is no available training data. For example, 

Pioneer space probes can enter different extra terrestrial orbits regardless of the availability of data for these orbits. 
Whilst data driven approaches do seem to avoid mechanistic assumptions about the data, typically the regularization which 
is applied encodes some kind of physical intuition, such as the smoothness of the interpolant. This does reflect a weak 
underlying belief about the mechanism that generated the data. In this sense the data driven approach can be seen as weakly 
mechanistic whereas models based on more detailed mechanistic relationships could be seen as strongly mechanistic. 
The observation that weak mechanistic assumptions underlie a data driven model inspires our approach. We suggest a 
hybrid system which involves a (typically overly simplistic) mechanistic model of the system. The key is to retain sufficient 
flexibility in our model to be able to fit the system even when our mechanistic assumptions are not rigorously fulfilled 
in practise. To illustrate the framework we will start by considering dynamical systems as latent variable models which 
incorporate ordinary differential equations. In this we follow the work of Lawrence et al. (2007} and GaoeFaD (12008) who 
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encoded a first order differential equation in a Gaussian process (GP). However, their aim was to construct an accurate model 
of transcriptional regulation, whereas ours is to make use of the mechanistic model to incorporate salient characteristics of 
the data {e.g. in a mechanical system inertia) without necessarily associating the components of our mechanistic model 
with actual physical components of the system. For example, for a human motion capture dataset we develop a mechanistic 



model of motion capture that does not exactly replicate the physics of human movement, but nevertheless captures salient 
features of the movement. Having shown how linear dynamical systems can be incorporated in a GP, we finally show how 
partial differential equations can also be incorporated for modelling systems with multiple inputs. 

The paper is organized as follows. In section [2] we motivate the latent force model using as an example a latent variable 
model. Section p] employs a first order latent force model to describe how the general framework can be used in practise. 
We then proceed to show three case studies. In section |4] we use a latent force model based on a second order ordinary 
differential equation for characterizing motion capture datasets. Section [5]presents a latent force model for spatio-temporal 
domains applied to represent the development of Drosophila Melanogaster, and a latent force model inspired in a diffusion 
process to explain the behavior of pollutant metals in the Swiss Jura. Extensive related work is presented in section|6] Final 
conclusions are given in section [7] 

2 From latent variables to latent functions 

A key challenge in combining the mechanistic and data-driven approaches is how to incorporate the model flexibility 
associated with the data-driven approach within the mechanism. We choose to do this through latent variables, more 
precisely latent functions: unobserved functions from the system. To see how this is possible we first introduce some well 
known data driven models from a mechanistic latent-variable perspective. 

Let us assume we wish to summarize a high dimensional data set with a reduced dimensional representation. For example, 
if our data consists of N points in a D dimensional space we might seek a linear relationship between the data, Y = 
[yi, . . . , yo] G R NxD with yd G M Arxl , and a reduced dimensional representation, U = [ui, . . . , Uq] G M Arx< ^ with 
u q G M JVxl , where Q < D. From a probabilistic perspective this involves an assumption that we can represent the data as 

Y = UW T +E, (1) 

where E = [ei, . . . , &jj] is a matrix-variate Gaussian noise: each column, e^ G M A ' xl (1 < d < D), is a multi-variate 
Gaussian with zero mean and covariance X, this is, e^ ~ J\f (0, Ed). The usual approach, as undertaken in factor analysis 
and principal component analysis (PCA), to dealing with the unknown latent variables in this model is to integrate out U 
under a Gaussian prior and optimize with respect to W G M r>x< 3 (although it turns out that for a non-linear variant of the 
model it can be convenient to do this the other way around, see for example |Lawrence (2005|>). If the data has a temporal 



nature, then the Gaussian prior in the latent space could express a relationship between the rows of U, u tn = Tu tn _ 1 + f], 
where T is a transformation matrix, tj is a Gaussian random noise and u tn is the n-th row of U, which we associate with 
time t n . This is known as the Kalman filter/smoother. Normally the times, t n , are taken to be equally spaced, but more 
generally we can consider a joint distribution for p (U|t), for a vector of time inputs t = \t\ . . . ijy] , which has the form 
of a Gaussian process, 

Q 
p(U|t) = n^(u,|0,K u „ u J, (2) 

5=1 

where we have assumed zero mean and independence across the Q dimensions of the latent space. The GP makes explicit 
the fact that the latent variables are functions, {u q (t)} =1 , and we have now described them with a process prior. The 
elements of the vector u q = [u q (ti), . . . , u q (tpf)] T , represents the values of the function for the q-th dimension at the times 
given by t. The matrix K u u is the covariance function associated to u q {t) computed at the times given in t. 
Such a GP can be readily implemented. Given the covariance functions for {u q (t)} =1 the implied covariance functions 



for {yd(t)}d=i are straightforward to derive. In Teh et al. (2005 i this is known as a semi-parametric latent factor model 



(SLFM), although their main focus is not the temporal case. If the latent functions u q {t) share the same covariance, but 



are sampled independently, this is known as the multi-task Gaussian process prediction model (MTGP) (Bonilla et al. 



2008 ) with a similar model introduced in Osborne et al. ( 2008 ). Historically the Kalman filter approach has been preferred, 



perhaps because of its linear computational complexity in N. However, recent advances in sparse approximations have 
made the general GP framework practical (see Quinonero-Candela and Rasmussen (2005 ) for a review). 
So far the model described relies on the latent variables to provide the dynamic information. Our main contribution is to 
include a further dynamical system with a mechanistic inspiration. We will make use of a mechanical analogy to introduce 
it. Consider the following physical interpretation of (IT): the latent functions, u q (t), are Q forces and we observe the 
displacement of D springs, yd(t), to the forces. Then we can reinterpret ([T) as the force balance equation, YB = US T + E. 
Here we have assumed that the forces are acting, for example, through levers, so that we have a matrix of sensitivities, 
S G M. Dx Q, and a diagonal matrix of spring constants, B G M r>xr) , with elements {Bd)d=v The original model is 
recovered by setting W T = S T B _1 and e^ ~ A/" (0, B T XdB). With appropriate choice of latent density and noise 



model this physical model underlies the Kalman filter, PCA, independent component analysis and the multioutput Gaussian 
process models we mentioned above. The use of latent variables means that despite this strong physical constraint these 
models are still powerful enough to be applied to a range of real world data sets. We will retain this flexibility by maintaining 
the latent variables at the heart of the system, but introduce a more realistic system by extending the underlying physical 
model. Let us assume that the springs are acting in parallel with dampers and that the system has mass, allowing us to write, 

YM + YC + YB = US + E, (3) 

where M and C are diagonal matrices of masses, {M c i}^ =1 , and damping coefficients, {Cd}^ =1 , respectively, Y is the first 
derivative of Y with respect to time (with entries {yd{t n )} for d = 1, . . . D and n = 1, . . . , N), Y is the second derivative 
of Y with respect to time (with entries {jjd(t n )} f or d = 1, . . . D and n = 1, . . . , N) and E is once again a matrix-variate 
Gaussian noise. Equation (|3]l specifies a particular type of interaction between the outputs Y and the set of latent functions 
U, namely, that a weighted sum of the second derivative for yd(t), iid(t), the first derivative for yd{t), yd{t), an d y<i{t) is 
equal to the weighted sum of functions {u q (t)}^ =l plus a random noise. The second order mechanical system that this 
model describes will exhibit several characteristics which are impossible to represent in the simpler latent variable model 
given by (fill, such as inertia and resonance. This model is not only appropriate for data from mechanical systems. There 
are many analogous systems which can also be represented by second order differential equations, for example Resistor- 
Inductor-Capacitor circuits. A unifying characteristic for all these models is that the system is being forced by latent 
functions, {u q (t)} =r Hence, we refer to them as latent force models (LFMs). This is our general framework: combine a 
physical system with a probabilistic prior over some latent variable. 

One analogy for our model comes through puppetry. A marionette is a representation of a human (or animal) controlled 
by a limited number of inputs through strings (or rods) attached to the character. In a puppet show these inputs are the 
unobserved latent functions. Human motion is a high dimensional data set. A skilled puppeteer with a well designed puppet 
can create a realistic representation of human movement through judicious use of the strings 

3 Latent Force Models in Practise 

In the last section we provided a general description of the latent force model idea and commented how it compares to pre- 
vious models in the machine learning and statistics literature. In this section we specify the operational procedure to obtain 
the Gaussian process model associated to the outputs and different aspects involved in the inference process. First, we illus- 
trate the procedure using a first-order latent force model, for which we assume there are no masses associated to the outputs 
and the damper constants are equal to one. Then we specify the inference procedure, which involves maximization of the 
marginal likelihood for estimating hyperparameters. Next we generalize the operational procedure for latent force models 
of higher order and multidimensional inputs and finally we review some efficient approximations to reduce computational 
complexity. 

3.1 First-order Latent Force Model 

Assume a simplified latent force model, for which only the first derivative of the outputs is included. This is a particular 
case of equation Q, with masses equal to zero and damper constants equal to one. With these assumptions, equation ([3]) 
can be written as 



Y + YB = US + E. (4) 



Individual elements in equation Q follow 

^^ + B d y d (t) = JT S d , q u q {t) + e d {t). (5) 

9=1 

Given the parameters {Bd\d=i an d {^,g}fcin=i' the uncertainty in the outputs is given by the uncertainty coming from 
the set of functions {u q (i)}^ =1 and the noise id(t). Strictly speaking, this equation belongs to a more general set of 
equations known as stochastic differential equations (SDE) that are usually solved using special techniques from stochastic 



calculus (0ksendal 2003). The representation used in equation (|3j is more common in physics, where it receives the name 



of Langevin equations (Reichl 1998). For the simpler equation |5]), the solution is found using standard calculus techniques 
and is given by 

Q 
y d (t) = y d {t )e- Bdt + Y, S d , q Gd[u q }(t) + Q d [e d ]{t), (6) 

9=1 



where y<i(to) correspond to the value of y<i(t) for t = to (or the initial condition) and Qd is a linear integral operator that 
follows 



Gd[v](t) = f d (t,v(t)) = 



,-B d (t-T) 



v(t)(\t. 



Our noise model Gd[sd](t) nas a particular form depending on the linear operator Q&. For example, for the equation in |6]) 
and assuming a white noise process prior for edit), it can be shown that the process Gd[ed]{t) corresponds to the Ornstein- 
Uhlenbeck (OU) process (Reichl, 1998). In what follows, we will allow the noise model to be a more general process and 
we denote it by Wd(t). Without loss of generality, we also assume that the initial conditions {yd(to)}d=i are zero, so that 
we can write again equation (|6) as 



Q 
Ud(t) = ^ S d , q Gd[u q }{t) + w d (t). 

0=1 



(7) 



We assume that the latent functions {u q (t)}^ =1 are independent and each of them follows a Gaussian process prior, this is, 
u q (t) ~ GV(0, k UqMq (t, f))r\ Due to the linearity of Gd, {yd(t)}d=i correspond to a Gaussian process with covariances 

Kd,v d > (*> l ') = cov bd(*)> yJW)} s iven b y 

cov[/<j(*),/d'(*')] +cov[wd(t),Wd'(t')]6d,d', 
where Sd,d' corresponds to the Kronecker delta and cov[/d(i), fd> (t 1 )] is given by 



£S d A g cov[/j(*), /J, (*')], 

9=1 

where we use /J (£) as a shorthand for fd(t, u q (t)). Furthermore, for the latent force model in equation Q, the covariance 
cov[/J(t), /«(«')] is equal to 



Jo Jo 



(8) 



Notice from the equation above that the covariance between f3(t) and fj, (£') depends on the covariance k u >u (r, r'). We 
alternatively denote cov[fd(t), fd> (t')} as k/ d j d , (t, if) and cov[/J(i), /J, (if)\ as kf«ji (t, t'). The form for the covariance 
k u ,u {t,f) is such that we can solve both integrals in equation ([H} and find an analytical expression for the covariance 
kf d j d , (t, if). In the rest of the paper, we assume the covariance for each latent force u q (i) follows the squared-exponential 
(SE) form ( |Rasmussen and W illiams 2006 ) 



k Uq ,u q (t,t') = cxp f- 






where £ q is known as the length-scale. We can compute the covariance fe/« /« (t, if) obtaining (Lawrence et al. 



k fi,aW) 



^„ 



[h d >,d(t',t) + h d ,d>(t,if)}, 



2007) 



(9) 



(10) 



where 



h d >,d{t',t) 



ex P(^,d') 
Bd + Bd> 



exp(-B d >t'){ exp(Bdd) 



et i I — v q ,d> j + erf ( — + Vq t d< 



exp(-B d t) 



erf 



-v q> d' +erf{v q4 >) 



(11) 



where erf(x) is the real valued error function, erf(x) = -j= J exp(— y 2 )dy, and v q ,d = £ q Bd/2. The covariance function 
in equation (jTOJ is nonstationary. For the stationary regime, the covariance function can be obtained by writing if = t + r 
and taking the limit as t tends to infinity. This is, fc,, A L (t) = lim^oo k f i f i (t, t + r). The stationary covariance could 



We can allow a mean prior different from zero. 



also be obtained making use of the power spectral density for the stationary processes u q (t), U q {uj) and the transfer function 
Hd(w) associated to h c i{t — s) = e~ Bd ( l ~ s \ the impulse response of the first order dynamical system. Then applying the 
convolution property of the Fourier transform to obtain the power spectral density of f3(t), F%(u), and finally using the 
Wiener-Khinchin theorem to find the solution for /5(i) ( Shanmugan and Breipohl 1988 1. 

As we will see in the following section, for computing the posterior distribution for {u q (t)}^: =1 , we need the cross- 
covariance between the output yd(t) and the latent force u q {t). Due to the independence between u q {t) and Wd{t), the 
covariance reduces to fc/ d . u (t, t'), given by 



fcW*»0 



^^exp(^)exp(-5 d (i-0) 



01 r I —p v q,d J + erf f y + v q4 



(12) 



3.2 Hyperparameter learning 

We have implicitly marginalized out the effect of the latent forces using the Gaussian process prior for {u q {t)}^ =1 and the 
covariance for the outputs after marginalization is given by k Vd:Vdl (t, t'). Given a set of inputs t and the parameters of the 
covariance function] 2 = ({Bd,}f =1 , {Sd,q}d=i q=n{£q}q=i)' tne marginal likelihood for the outputs can be written as 



p(y|t,0)=W(y|O,K f , f + S), 



(13) 



f € 



iNDxND 



with each element given by cov[/d(4 n ), fd'(t' n ,)) for n — 1, ... ,7V and n' = 



where y = vcc Yr] Kf 

1, .. . , N and X represents the covariance associated with the independent processes Wd(t). In general, the vector of 

parameters 9 is unknown, so we estimate it by maximizing the marginal likelihood. 

For clarity, we assumed that all outputs are evaluated at the same set of inputs t. However, due to the flexibility provided by 

the Gaussian process formulation, each output can have associated a specific set of inputs, this is t^ = [if, . . . , tff ]. 

Prediction for a set of input test t* is done using standard Gaussian process regression techniques. The predictive distribution 

is given by 



p(y*|y,t,0) =^V(y*lA»*,K y . )y J, 



(14) 



with 



M 



K 



y*,y« 



= K f „ f (K f , f + S)- i y, 



S) _1 K7 f . 



Zj* 



where we have used Kf t f t to represent the evaluation of Kf.f at the input set t*. The same meaning is given to the 

covariance matrix Kf^.f. 

As part of the inference process, we are also interested in the posterior distribution for the set of latent forces, 



p(u|y, t, 9) = A/"(u|/x u , y , K u 



Ay) 



(15) 



with 



Mu , y = Kj; u (K f , f + E)- 1 y ) 



K 



u|y 



K UjU - Kl u (K f , f + E)- 1 K f)U , 



where u = vec U, K uu is a block-diagonal matrix with blocks given by K u u . In turn, the elements of K u u are given 
by k Uq , Uq (t, t') in equation |9]), for {t n }n=i- Also Kf, u is a matrix with blocks Kf dU(j , where Kf d . Uq has entries given by 
kfd,u q (t, f) in equation ( |12| l. 

3.3 Higher-order Latent Force Models 

In general, a latent force model of order M can be described by the following equation 



M 



^P m [Y]A m = US T +E, 



(16) 



Also known as hyperparameters. 
3 x = vec X is the vectorization operator that transforms the matrix X into a vector x. The vector is obtained by stacking the columns of the matrix. 



where V m is a linear differential operator such that 2? m [Y] is a matrix with elements given by V m y d (t) = d ^ ' and A m 
is a diagonal matrix with elements A m> d that weights the contribution of V m ijd- 

We follow the same procedure described in section [3~T| for the model in equation ( fT6) > with M = 1. Each element in 
expression ( [To} can be written as 

A/ Q 

Kvd = J2 A m,dD m Vd(t) = J2 S d , q u q (t) + e d (t), (17) 

m— 9—1 

where we have introduced a new operator Vq 1 that is equivalent to apply the weighted sum of operators T> m . For a 
homogeneous differential equation in ( (17} , this is u g (i) = for g = 1, . . . , Q and e d (t) = 0, and a particular set of initial 
conditions {T> m y d (to)}^~Q , ^ i s possible to find a linear integral operator Q d associated to Vq 1 that can be used to solve 
the non-homogeneous differential equation. The linear integral operator is defined as 



Qd[v]{t) = fd(t,v(t)) = I G d (t,T)v(T)dT, (18) 

where G d (t, s) is known as the Green's function associated to the differential operator T>q , v(t) is the input function for the 
non-homogeneous differential equation and T is the input domain. The particular relation between the differential operator 
and the Green's function is given by 

V^[G d (t,a)] = 6(t-s), (19) 

with s fixed, Gd(t, s) a fundamental solution that satisfies the initial conditions and<5(£— s) the Dirac deltajfunction ( Griffel 



2002). Strictly speaking, the differential operator in equation (jT9j is the adjoint for the differential operator appearing in 
equation ( |T7] >. For a more rigorous introduction to Green's functions applied to differential equations, the interested reader 
is referred to |Roach| ( |1982| l. In the signal processing and control theory literature, the Green's function is known as the 
impulse response of the system. Following the general latent force model framework, we write the outputs as 

Q 
y d (t) = J2 S d , q Qd[u q ]{t) + w d (t), (20) 

q=l 

where w d (t) is again an independent process associated to each output. We assume once more that the latent forces fol- 
low independent Gaussian process priors with zero mean and covariance k u iU (t, if). The covariance for the outputs 
k Vd,y d , (*j *') is given by k fd j d , (t, t') + k WdtWd , (t, t')8 d>d >, with k fdJd , (t, if) equal to 



9=1 



,Sd,qS d >, q kf*j'i i {t,i!), (21) 

q=l 

and ktv f q (t,t') following 

J d 'J d' 

J J / G d (t - T)G d , (f - T')k Uq ^ Uq (r, r')dr'dr. (22) 



Learning and inference for the higher-order latent force model is done as explained in subsection 3.2 The Green's function 



is described by a parameter vector tp d and with the length-scales {i q }g—i describing the latent GPs, the vector of hyper- 
parameters is given by 6 = {{ip d }^ = i, {Sd,g}d=i,ci=ii {^q}q=i\- The parameter vector 9 is estimated by maximizing the 
logarithm of the marginal likelihood in equation ( fT3] >, where the elements of the matrix Kf f are computed using expression 
( |2"Tj i with kfi fi (t, t') given by ( |22) , For prediction we use expression ( fT4| i and the posterior distribution is found using 
expression ( |15) , where the elements of the matrix Kf u , kf dtUq (t, t') = kf u (t, t'), are computed using 

S d , q J G d (t- r)k Uq . Uq (r, t')dr. (23) 

In section HI we present in detail a second order latent force model and show its application in the description of motion 
capture data. 



We have used the same notation for the Kronecker delta and the Dirac delta. The particular meaning should be understood from the context. 



3.4 Multidimensional inputs 



In the sections above we have introduced latent force models for which the input variable is one-dimensional. For higher- 
dimensional inputs, x G W, we can use linear partial differential equations to establish the dependence relationships 
between the latent forces and the outputs. The initial conditions turn into boundary conditions, specified by a set of functions 
that are linear combinations of j/d(x) and its lower derivatives, evaluated at a set of specific points of the input space. 
Inference and learning is done in a similar way to the one-input dimensional latent force model. Once the Green's function 
associated to the linear partial differential operator has been established, we employ similar equations to ( f22] ) and (J23j to 
compute kf d y/(x,x') and fe/ d]U (x, x') and the hyperparameters appearing in the covariance function are estimated by 
maximizing the marginal likelihood. In section B] we will present examples of latent force models with spatio-temporal 
inputs and a basic covariance with higher-dimensional inputs. 



3.5 Efficient approximations 

Learning the parameter vector 6 through the maximization of expression ( p"3j ) involves the inversion of the matrix Kf f + S, 
inversion that scales as 0(D 3 N 3 ). For the single output case, this is D = 1, different efficient approximations have been 
introduced in the machine learning literature to reduce computational complexity including |Csato andO pper (2001 ); Seeger 



et al.|fl2003|l;|Quinonero-C andela and Rasmussen (2005); Snelson and Ghahramani (2006); Rasmussen and Williams (2006); 
Titsias (2009 i. Recently, Alvarez and Lawrence! (|2009)) introduced an efficient approximation for the case D > 1, which 



exploits the conditional independencies in equation ( fT8) : assuming that only a few number K < N of values of v(t) are 
known, then the set of outputs fd(t, v(t)) are uniquely determined. The approximation obtained shared characteristics with 



the Pa rtially Independent Training Conditional (PITC) approximation introduced in Quinonero-Candela and Rasmussen 



(2005) and the authors of 



Alvarez and Lawrence 



(2009) refer to the approximation as the PITC approximation for multiple- 



outputs. The set of values {f(i/c)}^ =1 are known as inducing variables, and the corresponding set of inputs, inducing inputs. 

This terminology has been used before for the case in which D = 1. 

A different type of approximation was presented in Alvarez et al. ( 2010 1 based on variational methods. It is a generalization 



of Titsias (2009) for multiple-output Gaussian processes. The approximation establishes a lower bound on the marginal 
likelihood and reduce computational complexity to 0(DNK 2 ). The authors call this approximation Deterministic Training 



Conditional Variational (DTCVAR) approximation for multiple-output GP regression, borrowing ideas from Quinonero- 
|Candela and Rasmussen| ( |2005] ) and |Titsias| ( p009] ). 



4 Second Order Dynamical System 

In Section[T]we introduced the analogy of a marionette's motion being controlled by a reduced number of forces. Human 
motion capture data consists of a skeleton and multivariate time courses of angles which summarize the motion. This motion 
can be modelled with a set of second order differential equations which, due to variations in the centers of mass induced 
by the movement, are non-linear. The simplification we consider for the latent force model is to linearize these differential 
equations, resulting in the following second order system, 



M, 



d 2 y d (t) t _ dy d (t) 



dt 2 



C d - 



dt 



BdVdit) = ^2 s d,qU q {t) + e d (t). 

9=1 



Whilst the above equation is not the correct physical model for our system, it will still be helpful when extrapolating 

predictions across different motions, as we shall see in the next section. Note also that, although similar to d5J, the dynamic 

behavior of this system is much richer than that of the first order system, since it can exhibit inertia and resonance. In what 

follows, we will assume without loss of generality that the masses are equal to one. 

For the motion capture data y d (t) corresponds to a given observed angle over time, and its derivatives represent angular 

velocity and acceleration. The system is summarized by the undamped natural frequency, ujQ d = \f~B d , and the damping 

ratio, C, d — \C d j \[B~ d . Systems with a damping ratio greater than one are said to be overdamped, whereas underdamped 

systems exhibit resonance and have a damping ratio less than one. For critically damped systems ( d = 1, and finally, for 

undamped systems (i.e. no friction) Q = 0. 

Ignoring the initial conditions, the solution of the second order differential equation is given by the integral operator of 

equation ( [18) , with Green's function 

Gd(t, s) = — exp(-ay(t - s)) sm{uj d (t - s)), (24) 



where uj d = y/4B d - C 2 /2 and a d = C d /2. 



CJ,1 



According to the general framework described in section |3.2| the covariance function between the outputs is obtained by 
solving expression ( [22] ), where k UqUq (t, t') follows the SE form in equation ([9]). Solution for kti f* (t, t') is then given by 



(Alvarez et al. 2009) 



Ko[h q (^fd',7d,t,t') + h q (jdnd',t',t) + hqdd' ,ld,t,t') + h q (pjd,ld' ,t',t) 
- h q (f)td',ld,t,t') - h q (j d ,7d> ,*',*) - h q {id>,ld>-t,t') -hgijdi'Yd' ,*',*)] 
where K = t q ^/8u>dUd>, ld = a d + joJd and j d = a d - jw d and the functions h q (%' , j d , t, t') follow 

T g (7 d ,,t',t)-e-^*T g ( 7d ,t',0) 



hqi'Jd' ,7d,t,t') 



Id + Id' 



l W*' \ ( (*-*')M f_(^y 



T q ( 7d ',t,t')=2e\ 4 y e ^'(*-*')- e V ~^Jw(jz d ^ q (t))-eA ^r; e (-W) w( _ iZd (0)), (25) 



with 



and Zd'. q (t) = (t — t')/(. q — (£ q jd')/2- Note that Zd>. q (t) e C, and w(jz) in ( |25) , for z G C, denotes Faddeeva's function 
w(j'z) = exp(z 2 )erfc(z), where erfc(z) is the complex version of the complementary error function, erfc(2;) = 1 — erf(z) = 
-?= J exp(— v 2 )dv. Faddeeva's function is usually considered the complex equivalent of the error function, since \v/(jz) \ 
is bounded whenever the imaginary part of jz is greater or equal than zero, and is the key to achieving a good numerical 
stability when computing ( |2"5] > and its gradients. 
Similarly, the cross-covariance between latent functions and outputs in equation d23) is given by 



fc/>,(M') = eq ^ d ^f [r q (j d ,t,t')-r q ( ld ,t,t% 

Motion Capture data 

Our motion capture data set is from the CMU motion capture data baseF] We considered two different types of movements: 
golf-swing and walking. For golf-swing we consider subject 64 motions 1, 2, 3 and 4, and for walking we consider subject 
35 motions 2 and 3; subject 10 motion 4; subject 12 motions 1, 2 and 3; subject 16, motions 15 and 21; subject 7 motions 
1 and 2, and subject 8 motions 1 and 2. Subsequently, we will refer to the pair subject and motion by the notation X(Y), 
where X refers to the subject and Y to the particular motion. The data was down-sampled by 4n Although each movement 
is described by time courses of 62 angles, we selected only the outputs whose signal-to-noise ratio was over 20 dB as 
explained in appendix[A] ending up with 50 outputs for the golf-swing example and 33 outputs for the walking example. 
We were interested in training on a subset of motions for each movement and testing on a different subset of motions for 
the same movement, to assess the model's ability to extrapolate. For testing, we condition on three angles associated to the 
root nodes and also on the first five and the last five output points of each other output. For the golf-swing, we use leave-one 
out cross-validation, in which one of the 64(F) movements is left aside (with Y = 1, 2, 3 or 4) for testing, while we use the 
other three for training. For the walking example, we train using motions 35(2), 10(4), 12(1) and 16(15) and validate over 
all the other motions (8 in total). 

We use the above setup to train a LFM model with Q = 2. We compare our model against MTGP and SLFM, also with 
Q = 2. For these three models, we use the DTCVAR efficient approximation with K = 30 and fixed inducing-points 
placed equally spaced in the input interval. We also considered a regression model that directly predicts the angles of the 
body given the orientation of three root nodes using standard independent GPs with SE covariance functions. Results for all 
methods are summarized in Table[T]in terms of root-mean-square error (RMSE) and percentage of explained variance (R 2 ). 
In the table, the measure shown is the mean of the measure in the validation set, plus and minus one standard deviation. 
We notice from tablefTlthat the LFM outperforms the other methods both in terms of RMSE and R 2 . This is particularly 
true for the R 2 performance measure, indicating the ability that the LFM has for generating more realistic motions. 

5 Partial Differential Equations and Latent Forces 

So far we have considered dynamical latent force models based on ordinary differential equations, leading to multioutput 
Gaussian processes which are functions of a single variable: time. As mentioned before, the methodology can also be 



5 The CMU Graphics Lab Motion Capture Database was created with funding from NSF EIA-0196217 and is available at http : / /mocap ■ cs ■ emu . | 
|edu] 

''We selected specific frame intervals for each motion. For 64(1), frames [120, 400]; for 64(2), frames [170, 420]; for 64(3), frames [100, 300]; and 
for 64(4), frames [80, 315]. For 35(2), frames [55, 338]; for 10(4), frames [222, 499]; for 12(1), frames [22, 328]; and for 16(15), frames [62, 342]. 
For all other motions, we use all the frames. 



Movement 


Method 


RMSE 


R 2 (%) 


Golf swing 


INDGP 
MTGP 

SLFM 
LFM 


21.55 ±2.35 
21.19 ±2.18 
21.52 ±1.93 
18.09 ±1.30 


30.99 ±9.67 
45.59 ±7.86 
49.32 ±3.03 
72.25 ±3.08 


Walking 


INDGP 

MTGP 

SLFM 

LFM 


8.03 ±2.55 
7.75 ±2.05 
7.81 ±2.00 
7.23 ±2.18 


30.55 ±10.64 
37.77 ±4.53 
36.84 ±4.26 
48.15 ±5.66 



Table 1 : RMSE and R for golf swing and walking 



applied in the context of partial differential equations to recover multioutput Gaussian processes which are functions of 
several inputs. We first show an example of spatio-temporal covariance obtained from the latent force model idea and then 
an example of a covariance function that, using a simplified version of the diffusion equation, allows an expression for 
higher-dimensional inputs. 

5.1 Gap-gene network of Drosophila melanogaster 

In this section we show an example of a latent force model for a spatio-temporal domain. For illustration, we use gene 
expression data obtained from the Gap-gene network of the Drosophila melanogaster. We propose a linear model that can 
account for the mechanistic behavior of the gene expression. 

The gap gene network is responsible for the segmented body pattern of the Drosophila melanogaster. During the blastoderm 
stage of the development of the body, different maternal gradients determine the polarity of the embryo along its anterior- 
posterior (A-P) axis (Perkins et al. , 2006). 




Figure 1 : Drosophila body segmentation genes. Blue stripes correspond to hunchback, green stripes to knirps and red stripes to eve- 
skipped at cleavage cycle 14 A, temporal class 3. 



Maternal gradient interact with the so called trunk gap genes, including hunchback (hb), Kriippel (Kr), giant (gt), and knirps 

(kni), and this network of interactions establish the patterns of segmentation of the Drosophila. 

Figure [T] shows the gene expression of the hunchback, the knirps and the eve-skipped genes in a color-scale intensity image. 

The image corresponds to cleavage cycle 14A, temporal class 3V\ 

The gap-gene network dynamics is usually represented using a set of coupled non-linear partial differential equations 

( [Perkins et aLl[2006HGursky et al.|[2004> 



dy d {x,t) 
dt 



= ((t)P d (y(x, t)) - X d y d (x, t) + B d 



d 2 Vd(x,t) 
dx 2 



where y d (x, t) denotes the relative concentration of gap protein of the <i-th gene at the space point x and time point t. The 
term P d (y(x, t)) accounts for production and it is a function, usually non-linear, of production of all other genes. The 
parameter A^ represents the decay and D^ the diffusion rate. The function £(£) accounts for changes occurring during the 



mitosis, in which the transcription is off (Perkins et al. 2006 1. 

We linearize the equation above by replacing the non-linear term ^{t)P d {y{x, tj) with the linear term J2q=i Sd,qU q (x, t), 

where S d , q are sensitivities which account for the influence of the latent force u q (x, t) over the quantity of production of 



7 The embryo name is dm!2 and the image was taken from http: //urchin . spbcas . ru/flyex/ 



gene d. In this way, the new diffusion equation is given by 

dy d (x,t) v-^ , . , . d 2 y d (x,t) 

7j7 = 2^ ^d,qU q (X, t) - X d y d {X, t) + D d — . 

Vq 

This expression corresponds to a second order non-homogeneous partial differential equation. It is also parabolic with one 
space variable and constant coefficients. The exact solution of this equation is subject to particular initial and boundary 
conditions. For a first boundary value problem with domain < x < I, initial condition y d (x,t = 0) equal to zero, and 



boundary conditions y d (x = 0, t) and y d (x = l,t) both equal to zero, the solution to this equation is given by Polyanin 
( |2002] i; |Bu&ovskiy and Pustyrnikov| | fl^95| >; |Stakgoldl l (T998) 



Vd(x,t) = y2S d ^ q / u q (£,T)G d {x,£,t-T)d£dT, 
, Jo Jo 



9=1 

where the Green's function G d (x, £, t) is given by 

G d (x, 1 1) = -e* J2 sin (-p) S 1 I ' ' 

n— 1 ^ 

We assume that the latent forces u q (x, t) follow a Gaussian process with covariance function that factorizes across inputs 
dimensions, this is 

/ ,, ( (t-t') 2 \ ( (x-x') 2 \ 
k Ug ,u g {x,t,x ,t ) =exp I 1 exp I 1 , 

where £ q represents the length-scale along the time-input dimension and P the length-scale along the space input dimension. 
The covariance for the outputs y d (x, t), k f i f q (x, t, x' , t'), is computed using the expressions for the Green's function and 
the covariance of the latent forces, in a similar fashion to equation (J22J), leading to 

. oo oo 

k fUl (x, t, x', t') = -= V V k) q r (t, t')k x fq , q (x, x'), (26) 

J d'->d' ^ *■ ^ ^ ^ Jd^d' J d^d' 

n—1 Tfl—1 

where fe% fq (t, t') and k x fq fq (x, x') are also kernel functions that depend on the indexes n and to. The kernel function 

J d'J d' ->d>Jd' 

k\ q fq (t, t') is given by expression ( fT0] > and k x q , q (x, x') is given by 

Id-Id' '' Jd'Jd' 

k x fq fq (x,x') — C(n,m,£ x )sm(u! n x)sm(uj m x'), 

where u n — 1J f- and u) m — w ^-. The term C(n, m, £ q ) represents a function that depends on the indexes n and to, and on 

the length-scale of the space-input dimension. The expression for C{n, to, £ x ) is included in appendixlBJ 

For completeness, we also include the cross-covariance between the outputs and the latent functions, which follows as 

kfd,v. q i x iti x ,t ) = y- 2_^ kf d ^ Uq yt,t )kf dUq {x,x ), 

n=l 

where ki u (t, t') is given by expression (fj"2|i, and k x u (x,x') follows 

k %,u q ( x > x ') =sm(w n x)C(x',n 7 £ x ), 

where C(x' , n, £ q ) is a function of x', the index n and the length-scale of the space input dimension. The expression for 
C(x', n, £ x ) is included in appendixlcl 

Prediction of gene expression data 

We want to assess the contribution that a simple mechanistic assumption might bring to the prediction of gene expression 
data when compared to a covariance function that does not imply mechanistic assumptions. 

We refer to the covariance function obtained in the section before as the drosophila (DROS) kernel and compare against the 
multi-task Gaussian process (MTGP) framework already mentioned in section [2] Covariance for the MTGP is a particular 
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case of the latent force model covariance in equations ( f2"T| and ( p2] >. If we make Gd{t — r) = <5(£ — r) in equation ( |2T) , 
and fc u „ (i, £') = k uu (t, t') for all values of g, we get 



kf d ,f d ,(t,t') 



9=1 



(t,f). 



Our purpose is to compare the prediction performance of the covariance above and the DROS covariance function. 

We use data from Perkins et al. (2006), in particular, we have quantitative wild-type concentration profiles for the protein 

products of giant and knirps at 9 time points and 58 spatial locations. We work with a gene at a time and assume that the 

outputs correspond to the different time points. This setup is very common in computer emulation of multivariate codes 

(see |Conti and O'Hagan| ( [20T0| ; [Osborne et al.| ( p008] l; [Rougier| ( [2508l i) in which the MTGP model is heavily used. For the 

DROS kernel, we use 30 terms in each sum involved in its definition, in equation (|26j>. 

We randomly select 20 spatial points for training the models, this is, for finding hyperparameters according to the description 



of subsection 3.2 The other 38 spatial points are used for validating the predictive performance. Results are shown in table 



[2] for five repetitions of the same experiment. It can be seen that the mechanistic assumption included in the GP model 
considerably outperforms a traditional approach like MTGP, for this particular task. 



Gene 


Method 


RMSE 


R 2 (%) 


giant 


MTGP 
DROS 


26.56 ±0.30 
2.00 ±0.35 


81.12 ±0.01 
99.78 ±0.01 


knirps 


MTGP 
DROS 


16.14 ±8.44 
3.01 ±0.81 


91.18 ±2.77 
99.60 ±0.01 



Table 2: RMSE and R for protein data prediction 



5.2 Diffusion in the Swiss Jura 

The Jura data is a set of measurements of concentrations of several heavy metal pollutants collected from topsoil in a 14.5 
km 2 region of the Swiss Jura. We consider a latent function that represents how the pollutants were originally laid down. 
As time passes, we assume that the pollutants diffuse at different rates resulting in the concentrations observed in the data 
set. We use a simplified version of the heat equation of p variables. The p-dimensional non-homogeneous heat equation is 
represented as 



dyd{x,t) 
dt 



^*/^ +*(«.*). 



i=i 



dx] 



where p = 2 is the dimension of x, the measured concentration of each pollutant over space and time is given by 2/d(x, t), 
Kd,j is the diffusion constant of output d in direction p, and $(x, i) represents an external force, with x = {xj}^ =1 . 
Assuming the domain M. p = { — oo < Xj < oo; j = 1, . . . , p} and initial condition prescribed by the set of latent forces, 



i(x) = ^2 Sd,qu q (x), at t - 0, 

9=1 



the solution to the system (Polyanin, 2002) is then given by 

y d (x,t)= f [ G d (x,x',t,r)$(x',r)dx'dr+ f G d (x,x',i,0)w(x')dx', 

JO JMp JRp 

where G q (x, x', t, r) is the Green's function given by 



(27) 



G d (x,x',t,r) 



2^/yiTU^ 



: exp 



(Xj-X'.f 



i=i 



4T dJ 



with Td.j(t, t) = Ka,j(t — t). The covariance function we propose here is derived as follows. In equation p7] ), we assume 
that the external force $(x, t) is zero, following 



Vd(x,t) 



E 

9=1 



S d , q I G d (x,x',i,0)^(x')dx'. 

JRP 



(28) 
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We can write again the expression for the Green's function as 



G d (x,x',i) 



{^Y'\YIU 2T < 



. exp 



'-'■J 



Afe-4) 5 



j=i 



4T rf , 



(27r)P/VlI?=i 



. exp 



E <W 






J'=l 



"J ^ 

2^ 



where ^j = 2T,jj = 2nd,jt. The coefficient ^j is a function of time. In our model for the diffusion of the pollutant 
metals, we think of the data as a snapshot of the diffusion process. Consequently, we consider the time instant of this 
snapshot as a parameter to be estimated. In other words, the measured concentration is given by 



2/d(x) 



Q 

5=1 



G d (x,x')w g (x')dx / , 



(29) 



where Gd(x, x') is the Green's function G<j(x,x',t) that considers the variable t as a parameter to be estimated through 
£d,j- Expression for G<j(x, x') corresponds to a Gaussian smoothing kernel, with diagonal covariance. This is 



G d (x,x') 



|P rf | 1/2 
(2tt)p/2 



exp 



1 



(x-x') T P d (x-x') 



where P d is a precision matrix, with diagonal form and entries {pd.j — j^ }?=i- 

If we take the latent function to be given by a GP with the Gaussian covariance function, we can compute the multiple 

output covariance functions analytically. The covariance function between the output functions, kti ti (x, x'), is obtained 



(27r)^|P« iC£ ,|i 



/2 



exp 



;(x-xV 



p9 

r d,d< 



(x-x') 



where P q dd , — P d + P^, 1 + A 1 , and A q is the precision matrix associated to the Gaussian covariance of the latent force 
Gaussian process prior. The covariance function between the output and latent functions, kti u (x, x'), is given by 



(27r)P/a|P«|i/: 



■ exp 



i(x-x') T (P^(x-x') 



where P^P^+A" 1 . 



Prediction of Metal Concentrations 



We used our model to replicate the experiments described in |Goova erts (pp. 248,249 1997) in which a primary variable 
(cadmium, cobalt, copper and lead) is predicted in conjunction with some secondary variables (nickel and zinc for cadmium 
and cobalt; copper, nickel and zinc for copper and lead)F]Figure0shows an example of the prediction problem. For several 
sample locations we have access to the primary variable, for example cadmium, and the secondary variables, nickel and 
zinc. These sample locations are usually referred to as the prediction set. At some other locations, we only have access to 
the secondary variables, as it is shown in the figure by the squared regions. In geostatistics, this configuration of sample 
locations is known as undersampled or heterotopic, where usually a few expensive measurements of the attribute of interest 
are supplemented by more abundant data on correlated attributes that are cheaper to sample. 

By conditioning on the values of the secondary variables at the prediction and validation sample locations, and the pri- 
mary variables at the prediction sample locations, we can improve the prediction of the primary variables at the validation 
locations. We compare results for the heat kernel with results from prediction using independent GPs for the metals, the 
multi-task Gaussian process and the semiparametric latent factor model. For our experiments we made use of ten repeats to 
report standard deviations. For each repeat, the data is divided into a different prediction set of 259 locations and different 
validation set of 100 locations. Root mean square errors and percentage of explained variance are shown in Tables [3] and HI 
respectively. 

Note from both tables that all methods outperform independent Gaussian processes, in terms of RMSE and explained 
variance. For one latent function (Q = 1), the Gaussian process with Heat kernel render better results than multi-task GPs 
(in this case, the multi-task GP is equivalent to the semiparametric latent factor model). However, when increasing the value 
of the latent forces to two (Q — 2), performances for all methods are quite similar. There is a still a gain in performance 
when using the Heat kernel, although the results are within the standard deviation. Also, when comparing the performances 
for the GP with Heat kernel using one and two latent forces, we notice that both measures are quite similar. In summary, 
the heat kernel provides a simplified explanation for the outputs, in the sense that, using only one latent force, we provide 
better performances in terms of RMSE and explained variance. 



Data available at http : //www. ai-geostats . org/ 
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i Cadmium 
' Nickel 
Zinc 




Figure 2: Sketch of the topsoil of Swiss Jura, 
cadmium, in the squared-regions. 



Secondary variables like nickel and zinc help in the prediction of the primary variable 



Method 


Cadmium (Cd) 


Cobalt (Co) 


Copper (Cu) 


Lead (Pb) 


INDGP 


0.8353 ± 0.0898 


2.2997 ±0.1388 


18.9616 ± 3.4404 


28.1768 ±5.8005 


MTGP (Q = 1) 


0.7638 ±0.1016 


2.2892 ±0.1792 


14.4179 ±2.7119 


21.5861 ±4.1888 


HEATK (Q = 1) 


0.6773 ± 0.0628 


2.06 ± 0.0887 


13.1788 ±2.6446 


17.9839 ± 2.9450 


MTGP (Q = 2) 


0.6980 ± 0.0832 


2.1299 ±0.1983 


12.7340 ±2.2104 


17.9399 ±1.9981 


SLFM (Q = 2) 


0.6941 ± 0.0834 


2.172 ±0.1204 


12.8935 ±2.6125 


17.9024 ±2.0966 


HEATK (Q = 2) 


0.6759 ±0.0623 


2.0345 ±0.0943 


12.5971 ±2.4842 


17.5571 ±2.6076 



Table 3: RMSE for pollutant metal prediction 



Method 


Cadmium (Cd) 


Cobalt (Co) 


Copper (Cu) 


Lead (Pb) 


INDGP 


15.07 ±7.43 


57.81 ±7.19 


25.84 ±7.54 


23.48 ±10.40 


MTGP (Q = 1) 


27.25 ±5.89 


58.45 ±5.71 


58.84 ±8.35 


56.85 ±11.60 


HEATK (Q = 1) 


43.83 ±8.71 


66.19 ±4.60 


65.55 ±8.21 


71.45 ±5.78 


MTGP (Q = 2) 


40.30 ±5.17 


64.13 ±5.10 


67.51 ±8.36 


69.70 ±6.90 


SLFM (Q = 2) 


40.97 ±5.15 


62.49 ±5.41 


67.35 ±8.29 


70.21 ± 6.04 


HEATK (Q = 2) 


43.94 ±6.56 


67.17 ±4.30 


68.40 ±6.46 


70.55 ±6.88 



Table 4: R for pollutant metal prediction 

6 Related work 

Differential equations are the cornerstone in a diverse range of engineering fields and applied sciences. However, their use 
for inference in statistics and machine learning has been less studied. The main field in which they have been used is known 



as functional data analysis (Ramsay and Silverman 2005). 

From the frequentist statistics point of view, the literature in functional data analysis has been concerned with the problem 
of parameter estimation in differential equations (Poyton et al. 2006, Ramsay et al. 2007): given a differential equation 
with unknown coefficients {A m }^f =0 , how do we use data to fit those parameters? Notice that there is a subtle difference 
between those techniques and the latent force model. While these parameter estimation methods start with a very accurate 
description of the interactions in the system via the differential equation (the differential equation might even be non-linear 



( Perkins et al. 2006 )), in the latent force model, we use the differential equation as part of the modeling problem: the 



differential equation is used as a way to introduce prior knowledge over a system for which we do not know the real 
dynamics, but for which we hope some important features of that dynamics could be expressed. Having said that, we 
review some of the parameter estimation methods because they also deal with differential equations with an uncertainty 
background. 
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Classical approaches to fit parameters 9 of differential equations to observed data include numerical approximations of 
initial value problems and collocation methods (references Ramsay et al. ( 2007 1 and Brewer et al. ( 2008 ) provide reviews 
and detailed descriptions of additional methods). 

The solution by numerical approximations include an iterative process in which given an initial set of parameters 9q and a 
set of initial conditions yo, a numerical method is used to solve the differential equation. The parameters of the differential 
equation are then optimized by minimizing an error criterion between the approximated solution and the observed data. For 
exposition, we assume in equation (JT7J that D = 1, Q = 1 and Si t i = 1. We are interested in finding the solution y(t) to 
the following differential equation, with unknown parameters 6 = {A m } 



M 



M 



V^y(t)=J2^ m V m y(t)=u(t), 



In the classical approach, we assume that we have access to a vector of initial conditions, yo and data for u(t), u. We start 
with an initial guess for the parameter vector 6 and solve numerically the differential equation to find a solution y. An 
updated parameter vector 9 is obtained by minimizing 



iV 



E(0) = Y,\mn)~y(tn)\\ 



through any gradient descent method. To use any of those methods, we must be able to compute dE(9)/d9, which is 
equivalent to compute dy(t)/d6. In general, when we do not have access to dy(t)/d0, we can compute it using what 



is known as the sensitivity equations (see Bard ( 1974) , chapter 8, for detailed explanations), which are solved along with 
the ODE equation that provides the partial solution y. Once a new parameter vector 9 has been found, the same steps are 
repeated until some convergence criterion is satisfied. If the initial conditions are not available, they can be considered as 
additional elements of the parameter vector 9 and optimized in the same gradient descent method. 

In collocation methods, the solution of the differential equation is approximated using a set of basis functions, {<j>i(t)}f =1 , 
this is y(t) = J2i=i /%&(*)• The basis functions must be sufficiently smooth so that the derivatives of the unknown func- 
tion, appearing in the differential equation, can be obtained by differentiation of the basis representation of the solution, 
this is, T> m y{t) — J2 @iD m 4>i(fi)' Collocation methods also use an iterative procedure for fitting the additional parameters 
involved in the differential equation. Once the solution and its derivatives have been approximated using the set of basis 
functions, minimization of an error criteria is used to estimate the parameters of the differential equation. Principal differen- 



tial analysis (PDA) (Ramsay 1996) is one example of a collocation method in which the basis functions are splines. In PDA, 
the parameters of the differential equation are obtained by minimizing the squared residuals of the higher order derivative 
V M y(t) and the weighted sum of derivatives {V m y(t)} m lQ, instead of the squared residuals between the approximated 
solution and the observed data. 



An example of a collocation method augmented with Gaussian process priors was introduced by |GraepeI] ( |2003| l. Graepel 
starts with noisy observations, y(t), of the differential equation 2?Q f y(i), such that y(t) ~ Af(V^y(t), a y ). The solution 
y(t) is expressed using a basis representation, y(t) — J2 Pi'Piit)- A Gaussian prior is placed over /3 = [/?i, . . . , /3j], 
and its posterior computed under the above likelihood. With the posterior over j3, the predictive distribution for y(t*) can 
be readily computed, being a function of the matrix Vq 1 ^ with elements {£ , o f( M^™)}n=i i=v K turns out that prod- 
ucts V^^iV^ 1 Q) that appear in this predictive distribution have individual elements that can be written using the 
sum J^j_ 1 X'o f (^i(i n )X'o <^i(t n /) = T>Q t U$ r t , J2i=i 4>i{tn)4>i{tri) or, using a kernel representation for the inner prod- 
ucts k (t n ,t n i) = '^2,' i=1 4>i{t n )(f>i{t n i),as,k T iM v m (£„,£„/), where this covariance is obtained by taking T>q derivatives of 

kit, t') with respect to t and T)^ 1 derivatives with respect to t' . In other words, the result of the differential equation V^y^t) 
is assumed to follow a Gaussian process prior with covariance k V M v m (t, t'). An approximated solution y(t) can be com- 
puted through the expansion y(t) = 5Z„ =1 a n k V M (t, t n ), where a n is an element of the vector (K v m v m + o-yl^^^-y, 



where K v 



t-,m is a matrix with entries kn 

o,t> i o,t' u 



^K^ tr 



and y are noisy observations of V§ y(t) 



Although, we presented the above methods in the context of linear ODEs, solutions by numerical approximations and col- 
location methods are applied to non-linear ODEs as well. 



Gaussian processes have been used as models for systems identification ( |Solaket al . 2003; Kocija net al.| [2005 ; Calderhead 
et al. 2009 Thompson 2009 1. In Solak et al. (2003 1, a non-linear dynamical system is linearized around an equilibrium 
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point by means of a Taylor series expansion ( Thomp son] |2009| l, 



y(t) 



E 

3=0 






(*• 



with a the equilibrium point. For a finite value of terms, the linearization above can be seen as a regression problem in which 
the covariates correspond to the terms (t — a) J and the derivatives j/w) ( a ) as regression coefficients. The derivatives are 
assumed to follow a Gaussian process prior with a covariance function that is obtained as k^' J ) (t, t'), where the superscript 
j indicates how many derivative of k(t, t') are taken with respect to t and the superscript j' indicates how many derivatives of 
kit, t') are taken with respect to t' . Derivatives are then estimated a posteriori through standard Bayesian linear regression. 
An important consequence of including derivative information in the inference process is that the uncertainty in the posterior 
prediction is reduced as compared to using only function observations. This aspect of derivative information have been 



exploited in the theory of computer emulation to reduce the uncertainty in experimental design problems (Morris et al. 
[T993l|Mitchell and Morris|[T99?| . 

Gaussian processes have also been used to model the output y(t) at time tk as a function of i ts L previous samples {y(t — 

The particular dependency 



-l)}h=i> a common setup in the classical theory of systems identification (Ljung 



1999) 



tk 

y(t) = g{{y{t — tk-i)}f = i), where <?(•) is a general non- linear function, is modelled using a Gaussian process prior and the 
predicted value for the output y* (tk) is used as a new input for multi-step ahead prediction at times tj, with j > k (|Kocijan| 
et al. 2005 1. Uncertainty about y* (tk) can also be incorporated for predictions of future output values ( Girard et al. 2003 ). 



On the other hand, multivariate systems with Gaussian process priors have been thoroughly studied in the spatial analysis 
and geostatistics literature (Higdon 2002, Boyle and Frean 2005; Journel and Huijbregts, 1978; Cressie, 1993; Goovaerts 
|1997[|Wack ernagel 2003 jl. In short, a valid covariance function for multi-output processes can be generated using the linear 
model of coregionalization (LMC). In the LMC, each output yd(t) is represented as a linear combination of a series of 
basic processes {u q }^ =1 , some of which share the same covariance function k Uq ^ Uq (t : t'). Both, the semiparametric latent 



factor model dTeh et al. 200511 and the multi-task GP (Bonilla et al. 2008) can be seen as particular cases of the LMC 



(Alvarez et al. 



2011b) 



Higdon (2002) proposed the direct use of a expression ( [T8| l to obtain a valid covariance function 
for multiple outputs and referred to this kind of construction as process convolutions. Process convolutions for constructing 
covariances for single output GP had already been proposed by Barry and Ver Hoef ( 1996| ); Ver Hoef and Barry (1998). 
Calder and Cressie (2007) reviews several extensions of the single process convolution covariance. Boyle and Frean (2005) 
introduced the process convolution idea for multiple outputs to the machine learning audience. Boyle (2007) developed 
the idea of using impulse responses of filters to represent Gd(t, s), assuming the process v(t) was white Gaussian noise. 
Independently, Murray-Smith and Pearlmutter (2005) also introduced the idea of transforming a Gaussian process prior 
using a discretized version of the integral operator of equation ( |18) , Such transformation could be applied for the purposes 
of fusing the information from multiple sensors (a similar setup to the latent force model but with a discretized convolution), 
for solving inverse problems in reconstruction of images or for reducing computational complexity working with the filtered 
data in the transformed space (Shi et al. 



2005) . 



There has been a recent interest in introducing Gaussian processes in the state space formulation of dynamical systems 
(Ko et al. 2007| Deisenroth et al. 20091 Turner et al. 2010| l for the representation of the possible nonlinear relationships 
between the latent space and between the latent space and the observation space. Going back to the formulation of the 
dimensionality reduction model, we have 

u*„ =gl(Ut n _ 1 )+77> 

y*„ =g2(u t J + e, 

where 77 and £ are noise processes and gi(-) and g2(-) are general non-linear functions. Usually gi(-) and g2(-) are 
unknown, and research on this area has focused on developing a practical framework for inference when assigning Gaussian 
process priors to both functions. 

Finally, it is important to highlight the work of Calder (Calder 2003 2007 2008[ ) as an alternative to multiple-output 
modeling. Her work can be seen in the context of state-space models, 



u t „ = u tn _ 1 +77, 
Yt n = G t n u t n +e, 



where y tn and u tn are related through a discrete convolution over an independent spatial variable. This is, for a fixed t Ti 
Vd ( s ) = Eg E* G d n (s - Zj)u g " (zj) for a grid of / spatial inputs {zi}j =1 . 
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7 Conclusion 

In this paper we have presented a hybrid approach to modelling that sits between a fully mechanistic and a data driven ap- 
proach. We used Gaussian process priors and linear differential equations to model interactions between different variables. 
The result is the formulation of a probabilistic model, based on a kernel function, that encodes the coupled behavior of 
several dynamical systems and allows for more accurate predictions. The implementation of latent force models introduced 
in this paper can be extended in several ways, including: 

Non-linear Latent Force Models. If the likelihood function is not Gaussian the inference process has to be accomplished in 
a different way, through a Laplace approximation (Lawrence et al. , 2007) or sampling techniques (Titsias et al. 2009). 
Cascaded Latent Force Models. For the above presentation of the latent force model, we assumed that the covariances 



k u .u {t, t') were squared-exponential. However, more structured covariances can be used. For example, in Honkela et al. 
(2010), the authors use a cascaded system to describe gene expression data for which a first order system, like the one 
presented in subsection 3.1 has as inputs u q (t) governed by Gaussian processes with covariance function (jT0]l. 
Stochastic Latent Force Models. If the latent forces u q (t) are white noise processes, then the correspondi ng differential 



Alvarez et al. 



equations are stochastic, and the covariances obtained in such cases lead to stochastic latent force models. In 
(2010), a first-order stochastic latent force model is employed for describing the behavior of a multivariate financial dataset: 
the foreign exchange rate with respect to the dollar of ten of the top international currencies and three precious metals. 
Switching dynamical Latent Force Models. A further extension of the LFM framework allows the parameter vector to 
have discrete changes as function of the input time. In Alvarez et al. ( 201 la i this model was used for the segmentation of 



movements performed by a Barrett WAM robot as haptic input device. 
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A Preprocessing for the mocap data 

For selecting the subset of angles for each of the motions for the golf-swing movement and the walking movement, we use 
as performance measure the signal-to-noise ratio obtained in the following way. We train a GP regressor for each output, 
employing a covariance function that is the sum of a squared exponential kernel and a white Gaussian noise, 



er| exp 



(x-x' 
W 



■ojf6(x,xf), 



where <r| and aj^ are variance parameters, and <5(x,x') is the Dirac delta function. For each output, we compute the 
signal-to-noise ratio as 101og 10 (cr|/cr^ r ). 

B Expression for C(n,m,£ x ) 

For simplicity, we assume Q = 1 and write Pi = t x . The expression for C(n, m, l x ) is given by 



C(n,m,£ x ) 



(s-p 2 
sin (w n £) sin (w m £') e L ^ J d£'d£. 



The solution of this double integral depends upon the relative values of n and to. If n ^ m, and n and to are both even or 
both odd, then the analytical expression for C(n 1 m, £ x ) is 



'-.7'' 



y/Tr(m 2 



{nX[W(m,4)]- mX[W(n,4)]} 
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where ![■] is an operator that takes the imaginary part of the argument and W(m,£ x ) is given by 



W(m,4) =w(jzj' 



being zj m = ^\ z] m = j- + l -^- and 7 m = juj r , 



-(£)%.-»»' 



v/(jz] m ), 



The term C(n, m, £ x ) is zero if, for n ^ m, n is even and m is odd or viceversa. 
Finally, when n = m, the expression for C(n, n, £ x ) follows as 



1„\FkI 



ft[W(n,4)]-Z[W(n,4)] 



£ n7r i 

-? 1 

2Z 2 



where 1Z[-] is an operator that takes the real part of the argument. 



nir 



<^f 



cos(n7r) — 1 



C Expression for C(x',n,£ x ) 

As in appendixlBJ we assume Q = 1 and £ x = £ x . The expression C(x', n, £ x ) is as 



with zl" x = f x 
argument. 



L x \Fk , 



tin ,7n,^' _ x'-l , lx7n 



e^ l w(jz]"' x ) - e^y^J w(jzj"' x ) 



7„ = jw n and ![•] is an operator that takes the imaginary part of the 
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